205 research outputs found

    Testing gene set enrichment for subset of genes: Sub-GSE

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many methods have been developed to test the enrichment of genes related to certain phenotypes or cell states in gene sets. These approaches usually combine gene expression data with functionally related gene sets as defined in databases such as GeneOntology (GO), KEGG, or BioCarta. The results based on gene set analysis are generally more biologically interpretable, accurate and robust than the results based on individual gene analysis. However, while most available methods for gene set enrichment analysis test the enrichment of the entire gene set, it is more likely that only a subset of the genes in the gene set may be related to the phenotypes of interest.</p> <p>Results</p> <p>In this paper, we develop a novel method, termed Sub-GSE, which measures the enrichment of a predefined gene set, or pathway, by testing its subsets. The application of Sub-GSE to two simulated and two real datasets shows Sub-GSE to be more sensitive than previous methods, such as GSEA, GSA, and SigPath, in detecting gene sets assiated with a phenotype of interest. This is particularly true for cases in which only a fraction of the genes in the gene set are associated with the phenotypes. Furthermore, the application of Sub-GSE to two real data sets demonstrates that it can detect more biologically meaningful gene sets than GSEA.</p> <p>Conclusion</p> <p>We developed a new method to measure the gene set enrichment. Applications to two simulated datasets and two real datasets show that this method is sensitive to the associations between gene sets and phenotype. The program Sub-GSE can be downloaded from <url>http://www-rcf.usc.edu/~fsun</url>.</p

    Assessing the power of tag SNPs in the mapping of quantitative trait loci (QTL) with extremal and random samples

    Get PDF
    BACKGROUND: Recent studies have indicated that the human genome could be divided into regions with low haplotype diversity interspersed with regions of high haplotype diversity. In regions of low haplotype diversity, a small fraction of SNPs (tag SNPs) are sufficient to account for most of the haplotype diversity of the human genome. These tag SNPs can be extremely useful for testing the association of a marker locus with a qualitative or quantitative trait locus in that it may not be necessary to genotype all the SNPs. When tag SNPs are used to reduce the genotyping effort in association studies, it is important to know how much power is lost. It is also important to know how much power is gained when tag SNPs instead of the same number of randomly chosen SNPs are used. RESULTS: We design a simulation study to tackle these problems for a variety of quantitative association tests using either case-parent samples or unrelated population samples. First, the samples are generated based on the quantitative trait model with the assumption of either an extremal sampling scheme or a random sampling scheme. Second, a small number of samples are selected to determine the haplotype blocks and the tag SNPs. Third, the statistical power of the tests is evaluated using four kinds of data: (1) all the SNPs and the corresponding haplotypes, (2) the tag SNPs and the corresponding haplotypes, (3) the same number of evenly spaced SNPs with minor allele frequency greater than a threshold and the corresponding haplotypes, (4) the same number of randomly chosen SNPs and their corresponding haplotypes. CONCLUSION: Our results suggest that in most situations genotyping efforts can be significantly reduced by using tag SNPs for mapping the QTL in association studies without much loss of power, which is consistent with previous studies on association mapping of qualitative traits. For all situations considered, two-locus haplotype analysis using tag SNPs are more powerful than those using the same number of randomly selected SNPs, but the degree of such power differences depends upon the sampling scheme and the population history

    Extreme Value Distribution Based Gene Selection Criteria for Discriminant Microarray Data Analysis Using Logistic Regression

    Full text link
    One important issue commonly encountered in the analysis of microarray data is to decide which and how many genes should be selected for further studies. For discriminant microarray data analyses based on statistical models, such as the logistic regression models, gene selection can be accomplished by a comparison of the maximum likelihood of the model given the real data, L^(D∣M)\hat{L}(D|M), and the expected maximum likelihood of the model given an ensemble of surrogate data with randomly permuted label, L^(D0∣M)\hat{L}(D_0|M). Typically, the computational burden for obtaining L^(D0∣M)\hat{L}(D_0|M) is immense, often exceeding the limits of computing available resources by orders of magnitude. Here, we propose an approach that circumvents such heavy computations by mapping the simulation problem to an extreme-value problem. We present the derivation of an asymptotic distribution of the extreme-value as well as its mean, median, and variance. Using this distribution, we propose two gene selection criteria, and we apply them to two microarray datasets and three classification tasks for illustration.Comment: to be published in Journal of Computational Biology (2004

    The effects of protein interactions, gene essentiality and regulatory regions on expression variation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Identifying factors affecting gene expression variation is a challenging problem in genetics. Previous studies have shown that the presence of TATA box, the number of <it>cis</it>-regulatory elements, gene essentiality, and protein interactions significantly affect gene expression variation. Nonetheless, the need to obtain a more complete understanding of such factors and how their interactions influence gene expression variation remains a challenge. The growth rates of yeast cells under several DNA-damaging conditions have been studied and a gene's toxicity degree is defined as the number of such conditions that the growth rate of the yeast deletion strain is significantly affected. Since toxicity degree reflects a gene's importance to cell survival under DNA-damaging conditions, we expect that it is negatively associated with gene expression variation. Mutations in both <it>cis</it>-regulatory elements and transcription factors (TF) regulating a gene affect the gene's expression and thus we study the relationship between gene expression variation and the number of TFs regulating a gene. Most importantly we study how these factors interact with each other influencing gene expression variation.</p> <p>Results</p> <p>Using yeast as a model system, we evaluated the effects of four separate factors and their interactions on gene expression variation: protein interaction degree, toxicity degree, number of TFs, and the presence of TATA box. Results showed that 1) gene expression variation is negatively correlated with the protein interaction degree in the protein interaction network, 2) essential genes tend to have less expression variation than non-essential genes and gene expression variation decreases with toxicity degree, and 3) the number of TFs regulating a gene is the most important factor influencing gene expression variation (R<sup>2 </sup>= 8–14%). In addition, the number of TFs regulating a gene was found to be an important factor influencing gene expression variation for both TATA-containing and non-TATA-containing genes, but with different association strength. Moreover, gene expression variation was significantly negatively correlated with toxicity degree only for TATA-containing genes.</p> <p>Conclusion</p> <p>The finding that distinct mechanisms may influence gene expression variation in TATA-containing and non-TATA-containing genes, provides new insights into the mechanisms that underlie the evolution of gene expression.</p

    Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Beta diversity, which involves the assessment of differences between communities, is an important problem in ecological studies. Many statistical methods have been developed to quantify beta diversity, and among them, UniFrac and weighted-UniFrac (W-UniFrac) are widely used. The W-UniFrac is a weighted sum of branch lengths in a phylogenetic tree of the sequences from the communities. However, W-UniFrac does not consider the variation of the weights under random sampling resulting in less power detecting the differences between communities.</p> <p>Results</p> <p>We develop a new statistic termed variance adjusted weighted UniFrac (VAW-UniFrac) to compare two communities based on the phylogenetic relationships of the individuals. The VAW-UniFrac is used to test if the two communities are different. To test the power of VAW-UniFrac, we first ran a series of simulations which revealed that it always outperforms W-UniFrac, as well as UniFrac when the individuals are not uniformly distributed. Next, all three methods were applied to analyze three large 16S rRNA sequence collections, including human skin bacteria, mouse gut microbial communities, microbial communities from hypersaline soil and sediments, and a tropical forest census data. Both simulations and applications to real data show that VAW-UniFrac can satisfactorily measure differences between communities, considering not only the species composition but also abundance information.</p> <p>Conclusions</p> <p>VAW-UniFrac can recover biological insights that cannot be revealed by other beta diversity measures, and it provides a novel alternative for comparing communities.</p

    Inference of Markovian Properties of Molecular Sequences from NGS Data and Applications to Comparative Genomics

    Full text link
    Next Generation Sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential. A plausible model for this underlying distribution of word counts is given through modelling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution, using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate them using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results, and that the clustering results that use a MC of the estimated order give a plausible clustering of the species.Comment: accepted by RECOMB-SEQ 201

    A network-based integrative approach to prioritize reliable hits from multiple genome-wide RNAi screens in Drosophila

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The recently developed RNA interference (RNAi) technology has created an unprecedented opportunity which allows the function of individual genes in whole organisms or cell lines to be interrogated at genome-wide scale. However, multiple issues, such as off-target effects or low efficacies in knocking down certain genes, have produced RNAi screening results that are often noisy and that potentially yield both high rates of false positives and false negatives. Therefore, integrating RNAi screening results with other information, such as protein-protein interaction (PPI), may help to address these issues.</p> <p>Results</p> <p>By analyzing 24 genome-wide RNAi screens interrogating various biological processes in <it>Drosophila</it>, we found that RNAi positive hits were significantly more connected to each other when analyzed within a protein-protein interaction network, as opposed to random cases, for nearly all screens. Based on this finding, we developed a network-based approach to identify false positives (FPs) and false negatives (FNs) in these screening results. This approach relied on a scoring function, which we termed NePhe, to integrate information obtained from both PPI network and RNAi screening results. Using a novel rank-based test, we compared the performance of different NePhe scoring functions and found that diffusion kernel-based methods generally outperformed others, such as direct neighbor-based methods. Using two genome-wide RNAi screens as examples, we validated our approach extensively from multiple aspects. We prioritized hits in the original screens that were more likely to be reproduced by the validation screen and recovered potential FNs whose involvements in the biological process were suggested by previous knowledge and mutant phenotypes. Finally, we demonstrated that the NePhe scoring system helped to biologically interpret RNAi results at the module level.</p> <p>Conclusion</p> <p>By comprehensively analyzing multiple genome-wide RNAi screens, we conclude that network information can be effectively integrated with RNAi results to produce suggestive FPs and FNs, and to bring biological insight to the screening results.</p

    Inherent Negative Refraction on Acoustic Branch of Two Dimensional Phononic Crystals

    Full text link
    Guided by theoretical predictions, we have demonstrated experimentally the existence of negative refraction on the lowest two (acoustic) passbands (shear and longitudinal modes) of a simple two dimensional phononic crystal consisting of an isotropic stiff (aluminum) matrix and square- patterned isotropic compliant (PMMA) circular inclusions. At frequencies and wave vectors where the refraction is negative, the effective mass density and the effective stiffness tensors of the crystal can be positive-defnite, and that, this is an inherent property of phononic crystals with an isotropic stiff matrix containing periodically distributed isotropic compliant inclusions. The equi-frequency contours and energy ux vectors as fuctions of the phase-vector components, reveal a rich body of refractive properties that can be exploited to realize, for example, beam splitting, focusing, and frequency filtration on the lowest passbands of the crystal where the dissipation is the least. By proper selection of material and geometric parameters these phenomena can be realized at remarkably low frequencies (large wave lengths) using rather small simple two-phase unit cells. Keywords: Doubly periodic phononic crystals, acoustic branch negative refraction, beam splitting, focusing, imaging, frequency filtration at large wave length

    Prioritizing functional modules mediating genetic perturbations and their phenotypic effects: a global strategy

    Get PDF
    A strategy is presented to prioritize the functional modules that mediate genetic perturbations and their phenotypic effects among candidate modules
    • …
    corecore